## Prerequisites

We will use the Transformers library from HuggingFace which is pip-installable:

pip install transformers

You'll also probably want to use PyTorch

## Exercise 1: Tokenization and Exbedding Exploration

The aim of this exercise is to visualize how text is broken down into tokens and converted into embeddings. 

1) Create a short ten word sentence
2) Tokenize it using a tokenizer from the Hugging Face model bert-base-uncased
3) Decode the tokens back into words
4) Use the model's embedding layer to project tokens into vectors
5) Visualize the embeddings using PCA

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Load a small model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Tokenize input
sentence = "Transformers are amazing models for NLP."
tokens = tokenizer(sentence, return_tensors="pt")
input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"]

# Show tokenized inputs
print("Input IDs:", input_ids)
print("Attention Mask:", attention_mask)

# Decode input IDs back into tokens
decoded_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
print("Decoded Tokens:", decoded_tokens)

# Get embeddings
with torch.no_grad():
    outputs = model(**tokens)
embeddings = outputs.last_hidden_state.squeeze(0)

# Reduce dimension for visualization
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings.numpy())

plt.figure(figsize=(8, 5))
for i, label in enumerate(decoded_tokens):
    x, y = reduced[i]
    plt.scatter(x, y)
    plt.text(x + 0.01, y + 0.01, label)
plt.title("Token Embeddings Visualized via PCA")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.show()

## Exercise 2: Build Your Own Scaled Dot-Product Attention

This exercise gets you familiar with the attention mechanism from scratch on small data.

1) Generate small random matrices for queries, keys, and values
2) Implement the scaled dot-product attention:

$ Attention(Q, K, V) = softmax \left( \frac{QK^T}{\sqrt{d_k}} \right) V $

3) Visualize the attention weights as a heatmap

In [None]:
import numpy as np

# Create random Q, K, V
np.random.seed(0)
Q = np.random.rand(3, 4) # Queries
K = np.random.rand(3, 4) # Keys
V = np.random.rand(3, 4) # Values

# Scaled dot-product attention
d_k = Q.shape[1]
scores = Q @ K.T / np.sqrt(d_k)
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = weights @ V

# Print the attention weights and output
print("Scaled Dot-Product Scores:\n", scores)
print("Attention Weights (softmax):\n", weights)
print("Output:\n", output)

# Plot the attention weights as a heatmap
plt.figure(figsize=(6, 5))
plt.imshow(weights, cmap='viridis')
plt.colorbar(label="Attention Weight")
plt.title("Attention Weights Heatmap")
plt.xlabel("Key Index")
plt.ylabel("Query Index")
plt.xticks([0, 1, 2])
plt.yticks([0, 1, 2])
plt.grid(False)
plt.show()

## Exercise 3: Multi-Head Attention 

This exercise shows how multi-head attention works by implementing a simplified version with synthetic data.

Repeat Ex. (2) with a synthetic input of 3 tokens, each with an 8-d embedding and 3 attention heads

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

# Set dimensions
num_tokens = 3     # sequence length
d_model = 8        # total embedding dimension
num_heads = 2
d_k = d_model // num_heads  # dimension per head

# Synthetic input: 3 tokens, each with 8-d embedding
torch.manual_seed(0)
X = torch.rand((num_tokens, d_model))  # [3, 8]

# Linear projections for Q, K, V per head (manually for clarity)
def project(X, W):
    return X @ W.T

# Create projection weights: 2 heads, each with separate Q, K, V
W_q = torch.rand((num_heads, d_k, d_model))
W_k = torch.rand((num_heads, d_k, d_model))
W_v = torch.rand((num_heads, d_k, d_model))

# Compute attention for each head
attn_outputs = []
attn_weights_all = []

for h in range(num_heads):
    Q = project(X, W_q[h])
    K = project(X, W_k[h])
    V = project(X, W_v[h])
    
    scores = Q @ K.T / (d_k ** 0.5)  # Scaled dot-product
    weights = F.softmax(scores, dim=-1)
    output = weights @ V
    
    attn_outputs.append(output)
    attn_weights_all.append(weights)

# Concatenate the outputs from all heads
multi_head_output = torch.cat(attn_outputs, dim=-1)

# Print the result
print("Multi-Head Output:\n", multi_head_output)

# Visualize attention weights
fig, axes = plt.subplots(1, num_heads, figsize=(12, 4))
for i, weights in enumerate(attn_weights_all):
    ax = axes[i]
    ax.imshow(weights.detach().numpy(), cmap='viridis')
    ax.set_title(f"Head {i+1} Attention")
    ax.set_xlabel("Key Index")
    ax.set_ylabel("Query Index")
    ax.set_xticks(range(num_tokens))
    ax.set_yticks(range(num_tokens))
plt.tight_layout()
plt.show()

## Exercise 4: Explore Attention on a Sentence

Here we will see how each word in a sentence attends to other in context.

1) Input a sentence into the DistilBERT model
2) Extract the attention weights from one or more layers
3) Use a heat map to visualize attention across words

Q. In your sentence, which words focus on others

Q. How does this vary between layers

In [None]:
from transformers import DistilBertModel, DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased", output_attentions=True)

sentence = "Transformers capture contextual relationships."
inputs = tokenizer(sentence, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

attentions = outputs.attentions[0][0]  # First layer, first batch
plt.imshow(attentions[0].numpy(), cmap='viridis')
plt.title("Self-Attention Heatmap (Head 0, Layer 0)")
plt.colorbar()
plt.show()